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Abstract 


V 

We  develop  a  mathematical  model  to  compute  the  minimum 
communication  cost  of  a  join-semijoin  program  for  processing  a 
given  equi-join  query.  Some  definitions  and  conditions  uponwhich 
this  paper  is  based  are  stated.  We  define  a  query  processing 
graph  for  each  equi-join  query  and  characterize  the  set  of 
join-semijoin  programs  which  solve  this  query.  A  rule  for 
estimating  the  size  of  the  derived  relation  is  derived.  The 
parameters  for  estimating  the  size  of  derived  relation  form  a 
consistent  parameter  system.  With  the  assumption  of 
communication  cost  dominance,  the  cost  functions  are  linear  in 
the  size  of  data  transmission.  An  optimization  problem  for 
distributed  query  processing  is  well  formulated. 
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1-  Introduction 

Query  processing  in  distributed  relational  databases 
corresponds  to  the  translation  of  requests,  formulated  in  a 
nonprocedural  relational  calculus  like  language,  into  a  sequence 
of  relational  algebra  operations  which  retrieve  and  update  data 
stored  in  the  distributed  database  management  systems  (DDBMSs) . 

Given  a  database  schema  D={R,  ,  Rz  , ....  },  a  query  can 
usually  be  written  in  a  number  of  alternative  algebraic 
expressions.  In  particular,  each  query  can  be  put  in  the 
following  form: 

Q  =  "n^L  (RjXR2X.  .  .  XR^) 

where  TL  contains  the  attributes  in  the  answer  relation;  q  is  a 
predicate  and  each  R  is  a  relation.  Usually,  TL  is  referred  to 
as  the  target-list  and  q  as  the  qualification  of  a  query.  We 
shall  assume  all  queries  are  expressed  in  this  canonical  form, 
denoted  by  Q*(a,  TL) . 

In  distributed  query  processing,  the  execution  of  a  query 
involves  data  transmissions  which  may  take  significant  time  in 
comparison  with  the  subquery  and  elementary  operation  execution 
times.  We  assume  that  the  data  communication  costs  dominate  the 
local  processsing  costs,  so  the  local  processing  cost  of  a  query 
(e.g.  costs  of  selection  and  projection  )  are  negligible.  In  this 
paper,  our  objective  is  to  minimize  the  total  data  transmission 
cost  for  processing  a  query. 


-4- 


For  a  query  Q=(q,TL)  .  let  (  R(,  Ra,  ....  R^  >  be  the  set  of 
relation  schemas  referenced  by  q  and  let  X  be  the  set  of 
attributes  appearing  in  q.  Before  processing  the  query,  we  can 
project  each  relation  R^  over  attributes  (XUTDHR^.  We  then 
execute  those  subqueries  which  reference  to  only  one  local 
relation. 

A  query  Q=(q,  TL)  is  a  conjunctive  equi-join  query  if  the 
qualification  q  is  a  conjunction  of  equi-join  clauses  of  the  form 
(R^  .  X=  R^..Y)  ,  where  X  and  Y  are  subsets  of  attributes  of  R^  and 
R^ respectively. 

In  this  paper,  we  restrict  our  study  to  a  class  of  equi-join 
queries.  Although  it  is  a  subset  of  complete  relational  calculus 
language,  it  is  a  rich  and  large  class  of  queries  in  practice. 

Data  transmission  is  required  when  two  relations  that  must 
be  joined  reside  at  different  sites.  To  perform  the  join,  one  way 
is  to  move  entire  relation  from  one  site  to  the  other.  The  other 
way  is  to  replace  a  join  by  semijoins  and  then  perform  a  join. 
Assume  R,  and  Ra  at  different  sites  and  we  v/ant  to  join  R  t  and  R2 
at  the  site  of  Rz  .  Using  semijoin  strategy,  one  can  send  the 
projection  of  R  on  its  joining  column  to  Rt's  site  and  perform  a 
semijoin  to  reduce  R,  by  Rz  before  sending  R,  to  R2's  site.  This 
will  be  a  profitable  tactic  only  whenever  the  projection  of  Ra  on 
its  joining  columns  smaller  than  the  amount  by  which  R,  is 
reduced  by  the  semijoin. 
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Prior  works  in  distributed  query  processing 
IW0NG77, GBWRR80 ,CHIU79 ,HY79]  were  either  limited  to  strategies  of 
performing  semijoins  first  and  then  joins  or  without  a  consistent 
parameter  system  to  estimate  the  size  of  derived  relation. 

In  this  paper,  we  first  state  some  definitions  and 
conditions  uponwhich  this  paper  based,  we  then  define  a  query 
processing  graph  for  each  equi-join  query  and  derive  a  theorem 
about  the  set  of  join-semijoin  programs  which  solve  this  query. 
Next,  we  define  a  rule  for  estimating  the  size  of  derived 
relation  and  prove  that  the  parameter  system  we  defined  is 
consistent.  With  the  definition  of  cost  functions,  we  develop  a 
mathematical  model  to  compute  the  minimum  communication  cost  of  a 
join-semijoin  program  for  processing  a  given  query. 

2.  Query  Processing  Model 

A  query  Q  specified  by  a  qualification  q  over  the  relations 
Rj  ,  Ri,...,Rfl,  and  by  a  target  list  TL  can  be  decomposed  into  a 
set  of  operations  {  p;  ,p2  . . .  .p^ }  which  will  produce  the  answer  to 
the  query,  where  p^^-,  the  set  of  relational  algebra  operators. 
In  general,  a  query  can  be  decomposed  into  several  different 
executing  sequences  which  will  produce  the  same  answer.  We  call 
such  an  executing  sequence  a  strategy.  Let  S(Q)  denote  the  set  of 
strategies  which  answer  the  query  Q.  The  goal  of  the  problem  is 
to  minimize  the  overall  cost  of  executing  this  query  Q.  We  can 
formulate  this  problem  as 
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X-l 

MIN  f  (P,DrO])=  Z3  f-(p. ,D [ i 3 ) 

P*S(Q)  ^=1  * 

s.t.  P=p(p2 . Pt 

Dli+l]*sp^(Dii]  ) 

DEO]  is  the  initial  database  state 


2.1  Definitions  and  Assumptions 

We  assume  that  the  distributed  database  management  system 
DDBMS  consists  of  a  collection  of  interconnected  computers  S, *  S2 

. Sfl,  at  different  sites.  Each  computer,  known  as  a  node  in 

the  network,  contains  a  DBMS.  Data  are  logically  viewed  in  the 
relational  model.  By  the  univeral  relation  interpretation 
IBFMMUY  81],  we  assume  that  each  site  only  consists  of  one 
relation. 

Data  transmission  in  the  network  is  via  communication  links. 
We  assume  that  the  transmission  cost  to  send  one  byte  of  data 
between  any  two  sites  i  &  j  is  known  and  equal  to  c  .So  the 
cost  function  of  transmitting  data  of  volume  V  between  two  sites 
i  &  j  is  a  linear  function  C  • : (V) =c *V.  We  assume  that  all 
possible  subqueries  involving  data  at  single  site  are 
preprocessed;  this  we  call  local  processing.  The  effort  of  local 
processing  is  to  reduce  the  amount  of  data  that  needs  further 
processing.  After  local  processing,  the  following  parameters  of 
the  qeury  can  be  defined. 


n  ■  number  of  sites  (i.e.  relations)  in  the  remaining  query 
d;  ■  I  R:  I  f  number  of  attributes  in  site  R- 
X. :  »  R;D  Rl  »  the  set  of  attributes  of  joining  domains 
between  &  Rj 

r^  =  number  of  tuples  in  relation  R^ 

w(A)  *  the  width  of  data  item  of  attribute  A  in  relation  R^ 
s.  =  z:  *  H  w(A)  r  the  size  of  the  relation  R- 
w  (X  js  )  *  JZj  w(A) 

Next,  we  define  several  terminologies  used  in  this  paper. 

Let  ( R;  r  r ; )  and  (R:,r :)  be  two  relation  states  and  X  S  R.flR'  . 
Definition: 

The  equi-join  of  R.  and  R;  on  X,  denoted  by  R.  M  R;  ,  is 

t  x  X  r 

{tit  is  a  tuple  over  R  ■  U  R  /  such  that  t  IR  •  I  <£  r .  A  t  TR ; I  €  r: 

*  f  A  A  it 

). 

The  semi join  of  R^and  R:on  X,  denoted  by  R.  IX  R • ,  equals  R^ 

9  X  0 

|x}  R^  IX) .  Equivalently  it  equals  (t^l  A  (3  t^-  £  /  ?  t^ 

[X  1  **t  ;  [X  J  )  )  • 

0 

The  natural  join  of  R.  &  R:  ,  denoted  by  R .  |X|  R- ,  is  the 

A  i  a.'  j 

join  of  R-  &  R •  on  R.  A  Rt  • 

J  ^  i  *■  <h 

The  natural  semijoin  of  R-&  R: ,  denoted  by  R.  |X  R:  .  is  the 

^  0  A*  f 

semijoin  of  R.  &  R.  on  R.  A  R;. 

A  f  A  ( 

Note  that  the  join  Cresp.  semijoin)  operator  is  weaker  than  the 
natural- join  (resp.  natural  semi-join)  operator  in  that: 


-8- 


R.  |X|  R;  C  R  |Xl  R-  V  X  £ R .  A  R •  . 

t  *•  X  i  « 

R.  |X  R-  C  R.  |X  R ?  V  X  £  R -  A  R:  . 

<*•'  /  “  A  *  /  *  J 

Definition : 

A  qualification  q  is  called  sub-natural  iff  for  each  clause 

VAXKsYA/l  '  A*K*Jy- 

q  is  called  natural  iff  the  converse  holds  as  well.  i.e. 
for  all  relation  schemas  R^  and  Rj,.  and  for  all  AK<£  R^, 
R,l*  Atc=R^  •  A(ci s  a  clause  of  q. 

Definition: 

Given  a  database  schema  D=(Rj  , R2, . . . rRrt) ,  a  query  is  called 
an  natural  join  query  (NJQ)  (resp.  sub-natural  join  query  , 
SNJQ)  iff  there  exists  a  natural  qualification  (resp. 
sub-natural  qualification)  for  it  and  TL£U(D). 

As  shown  in  (BG  813,  any  query  Q= (q.TL)  with  an  equijoin 
qualification  q  and  a  target  list  TL  can  be  efficiently 
transformed  into  an  equivalent  natural  join  query.  Instead  of 
the  class  of  equijoin  queries.  EQJ.  we  shall  construct  the  query 
processing  model  in  terms  of  the  class  of  natural  join  query, 
NJQ. 


In  DDBMS ,  we  define  two  types  of  directed  operators. 
Definition: 

1.  <1X1.;  (or  R . < 1 X I  R;)  is  the  distributed  natural  join 
*•«  K  o 

operator  which  send  R  •  to  R-  and  perform  natural  join  of 

t  * 

R.  and  R-  at  R.'s  site- 

*  b  * 


2.  < I X  •  r  (or  R .  < I X  R- )  is  the  directed  natural  semijoin 

operator  which  project  X=R .  A  R*.  over  R-  ,  send  the 

A  f  i 

results  to  R  .  and  perform  the  join  of  R  •  and  the  result 

A/ 

at  R.'s  site.  (i.e.  R-  IXITLR:  at  R;'s  site). 

<*•  *  X  i  *• 

Note  that  IX I »RX  I X I >R^  and  Xl>^  =RX  Xl>R^  are  similarily 
defined.  One  can  use  them  interchangeablely.  The  semijoin 

operation  only  reduces  the  relation  state  without  changing 

relation  schema. 

Definition: 

A  join-semijoin  program  P=P, P5  .  ..p^  a  sequence  of 

distributed  natural  join  and  distributed  natural  semijoin 
operators. 

A  natural  join  qualification  q  with  final  node  at  R  can  be 
done  by  sending  all  relations  R^ .  i+1,  to  R,  and  performing  R, 
IX  IR.JX  I  . .  .  IX  lRn  at  node  R  {  .  So  R^IX  I  >R , ,  R.IXOR,  » . . .  R  ^  IX  I  >R , 

or  its  permutation  are  j oin-semijoin  programs  of  this 

qualification  q. 

2.2  Query  Processing  Graph 

We  define  a  processing  graph  of  a  qualification  over  a 

database  schema  D^tR^J.^to  be  a  graph  with  two  type  of  edges.  <Yi 
.Ai.Bx>.  is  the  set  of  node  which  is  equal  to  D.  Aa  is  a  set 
of  semijoin  edges  which  is  (a- .  *  (R- ,R:  )6A .  I  if  R.nR;$0  and  R.  £  R: 

t  *9  1  *  €  *  t 

).  We  denote  it  by  i  - > -  j  with  one  arrow  on  the  edge.  =VA 

xVA  Mb^j  i+jJ  is  the  set  of  join  edges.  We  denote  it  by  i 


- >> -  j  with  two  arrows  on  the  edge. 


Note  that  if  R^A  Rj=$,  then  we  can  not  perforin  a  semijoin 

between  R.  and  R:  .  So  a  is  not  a  semijoin  edoe.  If  R-S  R;  , 
*•  s  *■ 

then  R^R^fiRj.  The  semijoin  of  R^-t»R^  ,  R^XI>R^f  is  the  same  as 
join  of  R^  to  R^ ,  R  IX I R  .  This  operation  is  covered  by  join  edge 


Example : 

R I  =  {  A ,  ,  A3 . A j . ) 

R^=  {  A^.  A  j  /  A^j.  t  A^  } 

Rj={  ,kg} 

The  processing  graph  of  the  natural  join  qualification  q  is: 


Without  lost  of  generality,  from  now  on  we  assume  that  the 
final  node  of  a  query  is  node  1. 

Definition: 

A  join-semijoin  program  P  is  correct  for  a  natural  join 
qualification  q  if  after  executing  the  program  P»  the 
final  node  will  have  a  new  relation  R'(  *  R,  lXlRx 

I X I ...  I X  I R^. 


-11- 


Lemma  1 


Proof : 


L  enma  2 


Proof : 


Theorem 


Any  directed  path  of  edges  in  from  R  ^to  ,  b,^  (£(  ,bK^ 
. ..,b„  ^  will  form  a  relation  R  IXlR^  IXI...IXIR„  in 

node  Ru. 

WE  prove  this  by  induction  on  the  length  of  the  path.  If 
1*1 /  then  the  path  is  b^ ^ .  After  this  operation  we  will 

have  R^(< - R^jXlR^  .  By  induction  assumption  on  1-1 »  Rj^ 

< - R  IXIR1<(  IX  I  ...  IX  I R^  (  .  In  the  case  of  1,  for  the 

first  1-1  edges  of  this  path,  Rj^  =RKo  IX  IR  K  IX  I  . . .  lXlRKf 
by  induction  assumption.  After  performing  b^  ^  ,  we  will 
have  R^  =R|<j(  ,XIRka  =rkJX  IRKi  IX  I  . . .  IX  IRKjl  . 

Any  directed  spanning  tree  toward  node  1  of  edges  in  B0 
will  form  R  ,  IX IR2 IX  I . . . IX IR ^ at  node  1. 

By  lemma  1,  any  path  from  R^  to  R  (  will  form  a  new 
relation  at  node  1  by  joing  all  relations  in  this  path.  A 
directed  spanning  tree  will  contain  each  node  i  exactly 
once.  If  v/e  execute  the  pathes  toward  node  1  one  by  one, 
we  will  result  the  relation  R( IX IRZIX I . . . IX IR ^  at  node 
1. 


Is 

Let  Q*(q,TL)  be  a  natural  join  query  and  TL=R,U...U  R  a  . 
Let  P  be  a  join-semijoin  program  for  q,  then  P  is  correct 
iff  there  exist  a  subset  of  the  set  (b^  }  in  P  which 
forms  a  inversely  directed  spanning  tree  toward  node  R(. 


I 
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Proof  : 


IF:  Natural  semijoin  only  reduce  a  relation  state  without 
losing  any  correct  data  and  do  not  change  the  relation 
schema.  So  after  performing  the  sequence  of  joins  in  the 
directed  spanning  tree  toward  R(,  we  get  the  R( lXl...lXlRa 
at  node  R(.  Any  other  join  operation  does  not  change  the 
state.  This  implies  P  is  correct. 

ONLY  IF:  For  each  semi  join  operation  a  -  in  P,  R/hR;  ¥$ 

•V  ~  f 

and  R^ £  R^ .  So  performing  the  semijoin  operations  do  not 
move  the  full  relation  state  from  node  R;  to  node  R;.  We 

A.  f 


still  need  to  perform  a  join  to  move  the  full  table  of  R^ 
to  R^.  If  there  do  not  exist  a  subset  of  (b^  }  in  P 
which  form  a  directed  spanning  tree  toward  node  R(,  then 
some  information  of  those  nodes  which  do  not  have  a  path 
toward  R,will  lose  some  information. 


From  theorem  1,  we  know  the  set  of  correct  programs  of  the 
NJQ  qualification  is  the  set  of  join-semi join  programs  such  that 
there  exists  a  directed  spanning  tree  toward  R  out  of  the  set  of 
join  edges  in  P.  We  denote  the  set  of  correct  programs  by  fp  . 
The  distributed  query  processing  problem  becomes  to  find  a 
program  pe'pwith  minimum  communication  cost.  For  a  program  p,  if 
we  change  the  order  of  the  sequence  of  operations,  the  total 
communication  cost  will  be  different.  The  set  of  correct  programs 
is  very  large.  In  fact,  after  executing  one  operation  in  P,  it 
will  change  the  number  of  rows  and  columns  of  some  relations. 
This  change  then  affects  the  communication  cost  of  next 


operation.  So  the  communication  cost  of  one  operation  will  depend 
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on  the  previous  subsequence  of  operations. 

2.3  Estimate  the  Size  of  the  Derived  Relations 


In  order  to  compare  the  communication  cost  of  query 
processing  strategies,  it  is  very  important  to  have  a  method  to 
estimate  the  size  of  a  relation  after  one  operation.  Also  the 
system  for  estimation  of  the  size  of  the  derived  relation  must  be 
consistent  in  the  sense  that  if  two  sequences  of  operations  will 
produce  the  same  results,  the  estimated  sizes  of  the  result 
according  the  two  sequences  of  operations  must  be  the  same. 


We  introduce  the  notion  of  semijoin  reducibility  and  join 

reducibility  of  to  R^.  ,  denoted  by  o and  (3^  respectively, 

for  each  pair  of  relations  R^  and  R^  .  Where  0£o<^$l  and  0$  j^^l. 

The  interpretation  of  the  semijoin  reducibility  o^:  of  R^  to  R; 

J  A  0 


is  that  the  percentage  of  rows  of 


R  •  will  be  reduced  after 
0 


performing  the  semijoin  R.  XI >  R-.  At  stage  t,  if  the  number  of 

A  <7 

rows  of  R;  is  r:tt-ll  and  the  semijoin  reducibility  of  R  -to  R.- 
it  t 

is  c£-  ft-1],  the  the  number  of  rows  of  R:  after  performing 

G  <f 

semijoin  R.  Xl>  Rr  will  be  reduced  to  r:  ft]=  r.-tt-lJ* 

A  d  tic 

[t-11).  Note  that  the  semijoin  reducibility  of  R  •  to  R:  is  not 

A  a 

equal  to  the  semi join  reducibility  of  R^  to  R^  and  0^Ct]=0  for 

all  t.  The  interpretation  of  the  join  reducibility  of  to  r. 

« 

is  that  after  performing  join  R5  I X I >  Rt ,  the  number  of  rows  of 

A  f 

new  relation  R-IXlR;  at  site  j  will  be  r;Ct]=  r-Ct-11*  r;tt-l]  * 

A  t  t  A  t 

(1-  c*^[t-l])  *  (l-^/tt-ll)  *  (1-  (?Ay[t-l]).  This  is  because  the 

affect  of  join  R  .  IX I >  R  •  is  equivalent  to  perform  the  semi joins 

a 


-14- 


R^  X I  >  and  RjXl>  R^  and  then  to  perform  the  join  of  R^  to  Rj  . 

Both  semijoin  reduciblities  and  join  reducibility  affect  the 
number  of  rows  of  the  new  relation.  The  join  reducibility  of  R- 

o 

to  R-  is  the  same  as  the  join  reducibility  of  R  •  to  R;  .  i.e. 

A  A 

j^Itl  =  Also  ^ItJ=0  for  all  t.  For  this  paper,  we  assume 

the  set  of  reducibilities  {  °£y  ,  fixj )  can  be  known  in  advance  by 
some  statistical  measurement. 


Since  the  number  of  rows  and  columns  of  a  relation  will  be 

changed  after  one  operation,  the  reducibilities  of  this  relation 

with  other  relations  will  be  changed  too.  We  define  how  the 

reducibilities  will  be  changed  after  one  operation.  Assume  the 

database  state  before  the  operation  p  be  D=(R  ( [ t— 1 1 , . . .R^Ct-1] } , 

the  number  of  rows  of  each  relation  R-[t-l]  be  r;It-ll*  and  the 

semijoin  and  join  reducibilities  of  R  •  C t— 1 ]  to  R-tt-1]  are 

<r 

o^Ct-1]  and  f^lt-ll. 


If  the  operation  p  at  stage  t  is  a  •• 

*■) 


i.e.  R • X  I >R  •  ,  then 

*  t 

the  database  schema  will  remain  the  same  and  only  the  number  of 
rows  of  relation  R  •  1 1 1  will  be  changed  to  equal  to  r.  [t-1]  *  (1 -<y 

C t— 1 J )  and  the  number  of  rows  of  all  other  relations  will  remain 
the  same. 


Since  this  semijoin  operation  a^j  will  reduce  the  number  of 


rows  of 
relations  R 
<?^[t]  of 


R:  t  the  semijoin  reducibility  of  R.  with  all  other 

I  & 


will  be  reduced  too.  The  semijoin  reducibility 


R  ^  [t]  with  all  other 


[tl  will  be  changed  as  the 


following  rules: 


0 

ii-/ 

•<**  ct-o 

***• 

***  Lt-O+c^ 
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The  semijoin  operation  does  not  affect  the  join  reducibilities . 
So  all  join  reducibilities  at  this  stage  stay  the  same  as  last 
stage,  i.e. 

^!tj[t]=  ^-[t-1]  for  all  h  and  k. 

If  the  operation  p^  at  stage  t  is  b^-  ,  i.e.  R^|X|>R^  ,  then 
the  database  schema  at  node  j,  R^[t]  will  change  to  R^[t-1]  U  R^ 
[t-1],  and  the  relation  state  at  site  j  will  be  R.  [t-1]  IX!  R- 

X* 

[t-1].  The  number  of  rows  of  R-[t],  r : C t ] ,  equal  to  r[t-1]  *  r. 

si  h  /C 

[t-1]  *  (1-  c<^[t-1])  *(1-  o^[t-1  ])  *  (1-  [t-1  ] ) .  All  other 

relation  states  will  remain  the  same.  Eecause  this  is  a  join 
operation,  the  semijoin  reducibilities  and  join  reducibilities 
will  be  affected. 

The  semijoin  reducibility  of  R^[t]  to  R^tt]  will  be  changed  as 
f  ollows : 

^  0  /  4t-  7 

oTwrt-a+ 

oC^Ct-0  o  bruise. 

V. 

The  join  reducibility  of  R^[t]  to  R^  [t]  will  be  changed  as 
follows : 


\ 


cf  rA-ct-o«o 


2.4  Consistent  Parameter  System 


We  say  that  a  parameter  system  is  consistent  if  this 
parameter  system  will  produce  the  same  estimates  of  the  size  of 
the  results  when  the  two  programs  really  produce  tha  same 
results.  we  now  define  a  parameter  system  which  we  shall  use  to 


estimate  the  size  of  the  derived  relations  has  the  parameters  {  r^ 


Theorem : 


The  parameter  system  (r  •  [t],  o^.-ttl , 

**  Q 

above  is  consistent. 


we  defined 


Proof:  This  is  because  a  semijoin  operation  and  join  operation 
can  only  affect  the  size  of  derived  relations  once  and 
the  rules  of  updating  reducibilities  have  reflecting  this 
fact.  So  the  order  of  the  operations  does  not  affect  the 
estimate  of  the  size  of  the  derived  relation. 


Example : 

Suppose  we  have  three  relations  R  (  ,Ri  and  R^,  where  R -/"> 
40  and  $  R^-  V  i,j;  and  suppose  cc^j.  1 01  » 
given,  then  the  processing  graph  will  be: 


[0]  are 


The  two  programs  will 


Let  P,  =b„  bai  and  Px  =  a„  a„  bi3  b,(  . 

produce  the  same  results  R , C 0 ]  ! X !  R2[0]  IX!  R,[0]  at 

site  1 .  By  the  rules  of  estimating  the  size  of  the 

derived  relation,  the  estimate  sizes  of  R,[0]  !Xi  ".^[0] 

! X 1  Rj[0]  derived  by  these  two  programs  will  be  the  same. 
3  3  3 

Which  is  TTr.  [03  *  if  ( 1  -  [  o] )  *  ( 1  -  (3A-:  [  0  ] ) . 

A  a  ;=i  7  -<j*'  '7 

2.5  Problem  Formulation 


In  order  to  write  down  the  mathematical  formulation  of 
distributed  query  optimization  problem,  we  need  to  know  the  cost 
function  of  each  operation.  From  our  assumption  of  linear  cost 
function  before,  we  can  write  down  the  cost  function  of  this  t ..  pe 
of  operations  at  stage  t.  The  projection  of  R-  over  R.  (3  H:  ay 

A  A  g 

result  the  number  of  rows  of  R  .  smaller  after  compressing.  H.re 

we  ignore  this  fact.  We  assume  the  number  of  rows  after 

projection  on  R.  over  R-fNRr  eaual  to: 

a.  *  i  ' 

NJj.  =  Min  {  r.[t],  TT  !dom(A)|  } 

*  A  A  £  X*j 

The  cost  of  operation  a  will  be 

i 

Cost  (  a  ::  )  =C  ..  *(  W,  .  *  ^  w(A)) 

1  /UXC.- 

and  the  cost  of  operation  b.will  be 

A* 

CostCb  •  ■  )=c  «(  r .  [t]  *  2L  w(A)). 

A*  A*  A  miu 


Based  on  the  distributed  query  processing  model  we 


developed,  the  formulation  of  the  disributed  query  optimize  ion 
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problem  is  as  follow: 


INPUT: 


1.  a  distributed  database  schema  D={  R  ( 1 01 , . . . .R^fOI } 

2.  the  width  w(A)  of  each  attribute  A  in  U(D) 

3.  the  number  of  rows  ,  rtO],  of  each  relation  R. 

Af  A* 

4.  the  semi join  reducibility  [0]  of  each  pair  of 
relations  R^  &  R^with 

5.  the  join  reducibility  ^101  of  each  pair  of  relations  R/ 
&  R^with  ^[01=0  and  jtyOl-  ^[01. 


OBJECTIVE: 


Find  an  optimal  join-semijoin  program  to  solve  the 
natural  join  program. 

Let  P=p ,  pz  . .  .P£  #■  then  the  proble  '  is  to  minimize 
X^costtp^  )  according  the  rules  of  updating  the  parameters  and  cost 
functions  defined  above. 


3.  Conclusion 


We  have  developed  a  mathematical  model  for  distributed  query 
processing  problem  for  a  class  of  equi-join  queries,  We  also 
define  rules  for  estimating  the  size  of  derived  relation.  The 
parameter  system  based  on  those  rules  is  consistent.  The  future 
research  will  be  to  develop  algorithms  for  solving  this  problem. 
The  reason  for  difficulty  in  solving  this  problem  is  that 
computing  the  cost  of  one  operation  depend  on  the  size  of  derived 
relation  which  is  the  result  of  previous  .aerations.  A  special 
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case  of  this  problem  has  been  shown  THUA81]  to  have  NP-complete 
complexity.  An  efficient  optimal  algorithm  for  this  problem 
seems  unlikely. 


J 
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